Homework 1 Instructions

DKU Stats 101 Spring 2024 Session 4

Author

Anonymous

Published

March 25, 2024

Part 1: One variable analysis

Q1: What kind of dataset do we have? (5 points)

  • According to the definitions in the textbook, describe the Five W’s for this dataset.
  • Who : Boats for sale
  • What : Features and characteristics of the boats
  • When : 2019
  • Where: BoatTrader.com / Kaggle
  • Why : Identify the key features of boats that best determine their price
  • Using the definitions in the textbook, describe the variable type for the following variables:

    • id
    • type
    • boatClass
    • year
    • condition
    • length_ft
    • beam_ft
    • dryWeight_lb
    • price
    • sellerId
  • Categorical variables : type, boatClass, condition
  • Ordinal variables: None
  • Identifiers : id, sellerId
  • Quantitative variables: year, length_ft, beam_ft, dryweight_lb, price

Note: year is also acceptable as a categorical variable

Points of emphasis:

  • You don’t need to know all of the details for where the dataset came from for the Five Ws but you do need to categorize all the variables properly.

Q2: Literature review (5 points)

Find a news article online that discusses what are the major features that determine what people are looking for boats and include the link in this section.

Based on the article and your own personal expectations, what are some ways we might expect the data to be distributed or variables related? Make a list of at least three things we should expect or look for in the data and write a reason why we should expect it (no need to cite academic papers, just write down your reasons). Reasons should be thoughtful and at least two sentences explaining your logic for the expectation.

Points of emphasis:

  • The article must deal with boats and expectations about the variables. Logic about expectations must be coherent.

Q3: Describing the data (10 points)

The first step in analyzing any dataset is doing some exploratory analysis of the variables.

  • Make a histogram of price.

Figure 1: Price histogram
  • Describe it using the three features of quantitative data.
  • Shape: is right skewed with several outliers at high values, asymmetrical. Unimodal.

  • Center: mean price is 71644, median is 35680; as you can see, the mean is being “pulled” to higher values due to the right skew.

  • Spread: IQR is 39952.5, so 50% of the observations fall within 39952.5 dollars, the middle half of the data. The standard deviation is 153042, which is about 400% of the IQR, because the standard deviation is affected by extreme outliers. This also indicates a distribution with a skew or outliers.

  • Does the histogram of price surprise you? Why or why not?
  • Not really. Right skewed makes sense because this is a count variable, which is usually right skewed. Maybe the extent of the skew suprises me a bit.
  • Which is a better measure of center of the histogram, mean or median?
  • Depends on what you are using the measure of center for, but for most cases, due to the right skew, the median is a better measure of center.
  • Make a nice table displaying the 5 number summary. You can make a nice table using either kable or with the visual editing mode of Quarto. Calculate the five number summary by using the min() , quantile(), median(), and max() functions to do this. Show your code in the document (echo: true).
kable(boats %>%
        summarise(min(price),  quantile(price, probs=0.25), median(price), quantile(price, probs=0.75), max(price)), 
      col.names = c("Min", "25%", "Median", "75%", "Max"))
Table 1: Price 5 number summary
Min 25% Median 75% Max
519 19995 35680 59947.5 1799900

There are quite a few other ways to generate this result, the above is just one example.

  • Calculate the standard deviation using the sd() function. Interpret it - is it large or small? How does it compare to the IQR? What does this tell you about the shape of the distribution?

Standard deviation(sd) is a kind of evaluation for how far each value is from the mean, representing the the spread of the data distribution, so standard deviation is often discussed at the same time as the mean. The result of sd() equals to the square root of the variance, with the same unit of the original data, but it can be greatly affected by outliers or skew. In this case, the standard deviation is $153042 which is quite large given that the mean is $71644, besides, most of the data is around $20,000-$60,000 based on the IQR, which is much smaller than the standard deviation - this is the influence of outliers.

  • Would this histogram benefit from a transformation, in your opinion? Why or why not? If it would, please transform it appropriately, make a new histogram, and describe the transformation.

Figure 2: Rexpression of price

Since the data is right skewed, we need a lower-order transformation, such as log or square root. Taking the log of price significantly increases the symmetry of the distribution. Even after taking the log, however, we can see there are still some outliers on the low and high side that are worth investigating.

  • Make a boxplot chart comparing the median of price according to the variable condition. If you previously transformed your data, keep it transformed for this step. Interpret this graph.

Figure 3: Distribution of price by condition

This graph seems to indicate that there is not a substantial difference between the distribution of new and used boats, though there are more outliers among new boats (in both directions) as opposed to used boats. The IQR of used boats is larger (indicating the larger box size), indicating that perhaps there is more variation among ‘average’ cases but fewer outliers. While the IQR of used boats being larger is not surprising (used boats may vary significantly in quality and age), the difference in outliers is surprising and worth further investigating.

Points of emphasis:

  • Well labeled graphs, with appropriate (not variable name) names for the x and y axes.
  • Legend labeled
  • Graphs that contain the correct amount of information
  • Reasonable, thoughtful interpretations of the requested statistics, not just one or two word answers.
  • Correct results for the requested statistics

Q4: Comparing categorical variables (5 points)

  • One interesting piece of information the dealer would like to know is if there is any change in the types of boats being sold according to what type of fuel it uses. In particular, they wonder if some fuel types are becoming less popular and therefore more difficult to sell in new boats. The best way to examine this relationship is with a contingency table.
Table 2: Contingency table of fuel type on boat condition
diesel gasoline other
new 52 1 62 101
used 2 22 75 72

We can see from this table that more new boats are gasoline powered while few are diesel powered, indicating that perhaps the trend is shifting from diesel. However, most boats are in the “other” category or do not have a fuel type listed.

  • Add margins to your table. Does it change your interpretation?
Table 3: Contingency table w/margins of fuel type on boat condition
diesel gasoline other Sum
new 52 1 62 101 216
used 2 22 75 72 171
Sum 54 23 137 173 387

Adding margins to the table allows us to see the total count for each row and column, and makes it more clear that the main fuel type is other or not listed. We can also see there are more new boats than used boats.

  • Now convert your table into a proportions table. Does this better help explain what the data show?
Table 4: Proportions table of of fuel type on boat condition
diesel gasoline other
new 0.24 0.00 0.29 0.47
used 0.01 0.13 0.44 0.42

A proportion table helps emphasize that most cases are either new boats in the “other” category or used boats in the gasoline category.

  • What should you recommend to the dealer? What are some additional questions or areas of research do the results suggest you should look into?

I would suggest doing further research on the other category and investigate some of the unlisted fuel types - it appears that standard fuel types (such as gasoline) are becoming less common. It may be that alternative fuels are on the rise.

Points of emphasis:

  • Reasonable, thoughtful interpretations of the requested statistics, not just one or two word answers.
  • Correct results for the requested statistics

Q5: Understanding and comparing distributions (5 points)

  • Using the five number summaries, calculate if length_ft has any outliers according to the rule described in the textbook for outliers in boxplots. Show your calculations. Do you believe the outliers identified are real outliers? Why or why not? Consider the purpose of your report when preparing your answer.
kable(boats %>%
        summarise(min(length_ft),  
                  quantile(length_ft, probs=0.25), 
                  median(length_ft), 
                  quantile(length_ft, probs=0.75), 
                  max(length_ft)), 
      col.names = c("Min", "25%", "Median", "75%", "Max"))
Table 5: Length (ft) 5 number summary
Min 25% Median 75% Max
1 18.58 21.5 25 120

Boxplot calculations for length_ft:

length_ft_med <- median(boats$length_ft)
length_ft_lq <- quantile(boats$length_ft, probs=0.25)
length_ft_uq <- quantile(boats$length_ft, probs=0.75)
length_ft_iqr <- IQR(boats$length_ft)
length_ft_uf <- length_ft_uq + 1.5*length_ft_iqr
length_ft_lf <- length_ft_lq - 1.5*length_ft_iqr
  • \(median=21.5\)
  • \(IQR=25-18.58=6.42\)
  • \(Upper\,fence=25+1.5\cdot6.42=34.63\)
  • \(Lower\,fence=18.58-1.5\cdot6.42=8.95\)
  • According to this rule, there are upper and lower outliers of length_ft.
Table 6: Outlier identities for length (ft)
ID Length (ft) Make Model Price
7243489 120.00 Sea-Doo Spark 4999
6985357 76.00 Sunseeker 76 Yacht 1483185
5015919 75.00 Leopard Express Motor Yacht 949000
6222534 72.00 Hatteras 72 Motor Yacht 349000
6949685 70.00 Kong & Halvorsen 70 Cockpit Motoryacht 785000
6504838 62.00 Custom 62 Breen Custom Aluminum 250000
7052740 60.00 Cruisers Yachts CANTIUS 60 CANTIUS 1799900
7248686 55.00 Sea Ray 500 Sundancer 194900
6848896 50.00 Gibson 50 ft Cabin Yacht 109900
6543157 50.00 Azimut Atlantis 50 469000
6815988 49.00 Beneteau 49 GT 619900
6685114 48.00 Island Packet 485 495000
6187610 48.00 Homemade Chinese Junk 95000
4554897 46.00 Silverton 41 Convertible 89900
5906935 45.00 Harbor Master Coastal 450 96900
4726031 45.00 Hatteras 45 Convertible w 500 HRS 49500
6600075 45.00 Sea Force IX Crew/Supply 200000
6936293 44.00 Sea Ray 44 Sundancer 280000
6823219 41.00 Mako 414 CC 513245
6823517 41.00 Mako 414 CC Family Edition 568245
7078337 41.00 Jeanneau Sun Odyssey 40.3 115000
6549904 40.00 Carver 400 Cockpit Motor Yacht 69900
7082587 39.00 Sea Ray 390 Motor Yacht 159995
6881975 39.00 Sea Ray 390 Sundancer 209900
6961691 38.00 Rampage 38 Express 179000
6776793 38.00 Hatteras 38 Convertible 129900
6320404 38.00 Beneteau America Gran Turismo 38 319000
7263380 37.99 Helmsman Trawlers 38 Pilothouse 439000
7002541 37.92 Fountain 38 Express Cruiser 148000
6598802 36.58 Carver 350 Mariner 149000
7155420 36.00 Cruisers Yachts 3672 Express 89900
7053944 36.00 Hunter 356 69900
7229278 36.00 Trojan Tri Cabin 32000
6377605 35.58 Carver 3697 Mariner 39995
7121237 35.00 Everglades 350 179900
6583407 35.00 Sea Ray 350 Sundancer Coupe 359900
6195401 35.00 Trojan 10.8 Meter Sedan (SRG) 29500
7256495 35.00 Tiara Sovran LE 205000
6974632 35.00 Island Packet 35 98000
5969400 35.00 Formula 34 PC 450101
7015211 35.00 Cruisers Yachts 3575 Express 60000
6965839 35.00 Silverton 352 Motor Yacht 72900
7250448 8.00 Yamaha Boats Waverunner VX 9699
6579959 5.00 Mercury 250 Hp Optimax Pro Xs 9500
6219877 3.00 Yamaha Outboards F20 3299
7183109 1.00 Bennington 20SLV 20835
7170931 1.00 Bennington 23SPDXP 60509
7203394 1.00 Mastercraft X23 88995
7170441 1.00 Bennington 23SPDXP 60509
7166454 1.00 Bennington 21SSBX 45020
7170917 1.00 Bennington 21SSRCX 43006
7170193 1.00 Bennington 24RSRX1 - 10 WIDE 74596

It appears that the largest outlier above the upper fence is a mistake value; the Sea Doo is in reality a small boat. So it should be excluded from the analysis. For outliers below the lower fence, all of the 1 foot boats are mistakes and the 3 foot and 5 foot boats are actually advertisements for engines, so they should also be excluded from our analysis. As for the large yacht outliers on the high side, it is a bit of a judgement call. These kind of boats are often meant for ocean going trips while most of the smaller boats are sold to people who primarily use them in lakes or rivers. It depends on what type(s) of boats the dealer intends to sell.

  • Create a graph of boxplot of length_ft by fuelType. What can you conclude from this display? Would any of these subgroups benefit from having length_ft re-expressed? Why or why not?

Figure 4: Distribution of length (ft) by fuel type

We can see here that larger boats are typically diesel powered while small boats may have a variety of fuel types. Combined with the data from the contingency table in the previous question, this indicates that few new large boats are for sale, most new boats for sale are probably of the smaller variety. It may be that large boats last for a long time while small boats do not or that large boats are not primarily sold through this website.

As for reexpression, while some of the distributions are a bit skewed, none of them look like they would benefit significantly from a transformation.

Points of emphasis:

  • Boxplots well labeled
  • Proper calculation of 5 number summaries
  • Shows work for calcuation of outliers
  • Makes a reasonable interpretation of the boxplot
  • Shows understanding of appropriate conditions for reexpression

Q6: The Normal distribution (5 points)

For boats under 20 feet, a rough rule of thumb is that the maximum weight capacity in kilograms of a boat is approximately the length in feet times the width (or beam) of the boat times 5. In a 2006 study of Americans, the average weight was approximately 70 kilos with a standard deviation of 11 kilos.

  • In formal notation, write the Normal model of human weight.

\(N(70, 11)\)

  • Select at least five boats that are under 20 feet from the database. Make a table of the boats.
Table 7: Five boats under 20 feet
ID Length (ft) Beam (ft) Make Model
6855924 10.30 3.67 Yamaha WaveRunner EX Deluxe
6908066 18.67 7.83 Nitro Z18
6967727 16.00 6.33 Tracker Super Guide V-16 SC
6598394 19.50 7.00 Tracker Grizzly 1860 CC Sportsman
6791674 11.60 4.00 Sea-Doo RXP-X 300

Selected boats will vary but this is one possible group

  • The title of the columns of the table should be z scores ranging from -2 to +2

  • Calculate what is the maximum number of passengers the boat can hold if all of the passengers each weigh the following z scores: (-2, -1, 0, 1, 2), show your calculations below your table.

Table 8: Boat capacity by \(z\) score
-2 -1 0 1 2
6855924 3 3 2 2 2
6908066 15 12 10 9 7
6967727 10 8 7 6 5
6598394 14 11 9 8 7
6791674 4 3 3 2 2

The general formula is given by:

max.wt <- (length_ft * beam_ft) * 5
    
total.pax <- floor(max.wt / (70 + (z.score*11)))
  • If you are a boat maker, what cutoff would you set for boat passenger capacity and why?

On one hand, the manufacturer will want to advertise having as large a capacity as possible as a selling feature. On the other hand, to avoid any safety problems, you would want to err on the side of being conservative and not risk having a boat be overloaded. I think the safety issue is more important so I would select a \(z\) score of 2 as the cutoff.

  • Look up three of the boats in your table on the internet and write down what the manufacturer set as the recommended number of people on the boat. What z score cutoff does it seem the boatmaker made in each of the cases?

Answers will vary here. Full points as long as the math is correct.

Points of emphasis:

  • Boats selected correctly
  • Table shows appropriate answers
  • Formula shown is correct and calculations are accurate
  • Reasonable logic offered for cutoff choice
  • Appropriate research for manufacturer information

Part 2: Two variable analyis

Q7: Relationship between variables (15 points)

  • Make a scatterplot of price as a function of length_ft. Add a linear smoother to the plot and label any points you consider to be an outlier using geom_text() - the label for the outlier should print the observation’s id. If necessary, transform any variables

Figure 5: Price vs. length in feet

This relationship is a bit questionable as to whether to transform it. You should at least consider the issue of transformation and either using a lower order transformation of price or leaving it untransformed is ok as long as you state your reasons. The below is based on the un-transformed graph but if you transformed it you need to interpret your results thoughtfully also.

  1. Do you think there is a clear pattern? Describe the association between price and length_ft.

There appears to be a positive linear relationship.

  • Direction - Positive
  • Form - Somewhat linear, it appears, as length becomes longer, that price becomes more unstable.
  • Strength - Low to medium; there are many observations that do not fit the linear pattern at low values of length.
  • Outliers - There a few outliers that may be worth removing on further consideration.
  1. Find out the details of any outliers you have identified. Do you think the outlier(s) should be excluded from the analysis? Why or why not?
Table 9: Price vs. length in feet - outliers
ID Length (ft) Make Model Condition Price
7052740 60 Cruisers Yachts CANTIUS 60 CANTIUS new 1799900
7243489 120 Sea-Doo Spark used 4999
6985357 76 Sunseeker 76 Yacht new 1483185

From this table, we can see that the two expensive boats are both a type of luxury boat and is in a new condition so it is not clear that it is an obvious outlier. The case of the SeaDoo is clearly a mistake in the data, this boat type is in reality much smaller. As we investigated in part 5, the cases with a very small length are also mistakes and should be excluded.

  • Make a second graph excluding any outliers you have identified

Figure 6: Price vs. length in feet - no outlier
  1. What do you estimate the correlation to be, without using technology?

Any reasonable guess is ok here

  1. Check the conditions for correlation
  • Quantitative variables condition: both are quantitative
  • Straight enough condition: the relationship is more or less straight
  • No outliers condition: there are a few outliers that cannot exclude, though probably will not result in a big change in the estimate.
  1. Find and interpret the correlation coefficient for this relationship
cor(boats.no.outlier$price, boats.no.outlier$length_ft)
[1] 0.7176885
  1. Interpret this graph. What is some useful information we could communicate to the client? What questions for additional investigation does this graph prompt for you?

For length over a certain size, it appears a linear model may be hard to develop. Instead, maybe it is easier to focus on models with length less than 20 feet, as that appears to be the main type of boat and also has a clearer linear relationship. We may want to ask the boat dealer why this is the case.

  • Now, make a third graph of the same two variables but color it by condition. Add a linear smoother for each condition.

Figure 7: Price vs. length in feet by condition
  1. How does this graphical display change your interpretation you developed in your answer to part 6? Why do you think you the relationship is structured like this? Explain.

We can see more clearly that actually the relationship is different between new and used boats. It seems possible that larger, more expensive boats lose their value more quickly than smaller boats when sold as used. This may indicate to the dealer that they should be careful buying larger boats to resell.

Points of emphasis:

  • Good quality interpretation of the graphs
  • Graphs appropriately labeled
  • Careful consideration of the outliers
  • Use of the textbook definitions to describe relationships
  • Appropriate condition check for correlation

Q8: Putting it all together (15 points)

Through the analysis conducted in the previous section and through at least one additional investigation of your own (which can be an additional graph or table, that analyzes a different relationship or distribution than one asked about in the questions above but you think is meaningful and important to communicate the client), write three paragraphs outlining what you think are the main findings of questions 1-7 plus your additional investigation. What would you recommend to your boss as to what types of boats sell for the most money? What information are we missing in this dataset that we would need to better understand boat price?

  • Analysis here can vary but must be at least two paragraphs
  • Should accurately summarize the information discovered by answering the previous questions
  • B-level answer will conduct a shallow additional analysis, A-level answer will show interesting additional analysis that builds on previous answers
  • Shows a good understanding of the limits of this dataset
  • Should be as precise as possible, don’t use general statements when you can be more specific